EDA and Methods

Author

Hermela Shimelis

Published

November 4, 2024

Data: Survival after chemotherapy for Stage B/C colon cancer

The data was identified from survival R package

Description

These are data from one of the first successful trials of adjuvant chemotherapy for colon cancer. Levamisole is a low-toxicity compound previously used to treat worm infestations in animals; 5-FU is a moderately toxic (as these things go) chemotherapy agent. There are two records per person, one for recurrence and one for death.

The purpose of this project is to compare survival between the untreated (Obs) group vs those treated with amisole (Lev), or amisole + 5-FU.

Column names:

id: id
study: 1 for all patients
rx: Treatment - Obs(ervation), Lev(amisole), Lev(amisole)+5-FU
sex: 1=male
age: in years
obstruct: obstruction of colon by tumour
perfor: perforation of colon
adhere: adherence to nearby organs
nodes: number of lymph nodes with detectable cancer
time: days until event or censoring
status: censoring status
differ: differentiation of tumour (1=well, 2=moderate, 3=poor)
extent: Extent of local spread (1=submucosa, 2=muscle, 3=serosa, 4=contiguous structures)
surg: time from surgery to registration (0=short, 1=long)
node4: more than 4 positive lymph nodes
etype: event type: 1=recurrence,2=death
Code
# Load library
library(dplyr)
library(survival)
library(janitor)
library(magrittr)
library(car)
library(ggplot2)
library(tidyverse)
library(broom)
library(MASS)
library(boot)

#print(citation("boot"), bibtex=TRUE)
Code
#Load data
colon <- as_tibble(colon)
head(colon)
# A tibble: 6 × 16
     id study rx        sex   age obstruct perfor adhere nodes status differ
  <dbl> <dbl> <fct>   <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
1     1     1 Lev+5FU     1    43        0      0      0     5      1      2
2     1     1 Lev+5FU     1    43        0      0      0     5      1      2
3     2     1 Lev+5FU     1    63        0      0      0     1      0      2
4     2     1 Lev+5FU     1    63        0      0      0     1      0      2
5     3     1 Obs         0    71        0      0      1     7      1      2
6     3     1 Obs         0    71        0      0      1     7      1      2
# ℹ 5 more variables: extent <dbl>, surg <dbl>, node4 <dbl>, time <dbl>,
#   etype <dbl>

Since the current analysis is focused on survival, filter data to death as the event type. This will create a data table with one row per individual.

Code
colon_surv <- colon%>%filter(etype == 2) 

# Identify participants with outome = death (not censored)
colon_surv$death <- if_else(colon_surv$status == 1, 1, 0) 

Identify participants who had recurrence. Identify those not censored for recurrence event. Filter event type = 1 (recurrence), status = 1.

Code
recurrence <- colon%>%filter(etype == 1 & status == 1)%>%dplyr::select(id)
recurrence <- recurrence%>%mutate(recurrence = 1) # list of patients with recurrence


colon_surv <- colon_surv%>%merge(recurrence, by = "id", all.x = TRUE)
colon_surv$recurrence[is.na(colon_surv$recurrence)] <- 0

I. Exploratory data analysis

Check missing values

Code
na_counts <- sapply(colon_surv, function(x)sum(is.na(x)))
na_counts
        id      study         rx        sex        age   obstruct     perfor 
         0          0          0          0          0          0          0 
    adhere      nodes     status     differ     extent       surg      node4 
         0         18          0         23          0          0          0 
      time      etype      death recurrence 
         0          0          0          0 
Code
# replace NAs with mode
table(colon_surv$differ)

  1   2   3 
 93 663 150 
Code
mode(colon_surv$differ)
[1] "numeric"
Code
median(colon_surv$nodes, na.rm= TRUE)
[1] 2
Code
colon_surv$differ <- if_else(is.na(colon_surv$differ), 2,colon_surv$differ)
colon_surv$nodes <- if_else(is.na(colon_surv$nodes), 2,colon_surv$nodes)

Insight: only nodes and differ columns have NA values. Replacing the 23 NAs in differ column with mode, and replace NAs in nodes with median.

Evaluate continuous variables

Code
# age
hist(colon_surv$age)

Code
hist(colon_surv$nodes)

Code
hist(colon_surv$time)

Insight: Age is normally distribute. Number of nodes is skewed to the right. Time is fairly normally distributed with most the individuals had event time between 500-3000 days.

Evaluate nodes column to investigate outliers

Code
t <- colon_surv%>%filter(node4 ==1) # samples with more than positive lymph nodes
hist(t$nodes) 

Insight: samples with greater than 4 lymph nodes have less than 5 count in nodes column, so the two columns are not consistent. Therefore, nodes column will not be used for further analysis.

Evaluate categorical variables

Code
summary_table <- colon_surv%>%summarise(count =n(),
                                        male = sum(sex), 
                                                       median_age = median(age),
                                                       ct_perforation = sum(perfor),
                                                       ct_adherence_nerby_organ = sum(adhere),
                                                       ct_death = sum(death))

summary_table
  count male median_age ct_perforation ct_adherence_nerby_organ ct_death
1   929  484         61             27                      135      452

Insight: Total number of participants: 929. About half of the participants are male and about half were censored, while the other half died.

Code
colon_surv <- colon_surv%>%mutate(differentiation = case_when(differ == 1 ~ "well",
                                                              differ == 2 ~ "moderate",
                                                              differ == 3 ~ "poor"),
                                  local_spread = case_when(extent == 1 ~ "submucosa",
                                                           extent == 2 ~ "muscle",
                                                           extent == 3 ~ "serosa",
                                                           extent == 4 ~ "contiguous"),
                                  surg_to_reg_time = case_when(surg == 0~ "short",
                                                               surg == 1 ~ "long"))

Frequency tables for categorical variables

Code
# frequency tables for categorical variables
# Tumor differentiation

colon_surv %>% 
  tabyl(differentiation, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 differentiation         Obs         Lev     Lev+5FU
        moderate 74.9% (236) 73.9% (229) 72.7% (221)
            poor 16.5%  (52) 14.2%  (44) 17.8%  (54)
            well  8.6%  (27) 11.9%  (37)  9.5%  (29)
Code
# extent of local spread
colon_surv %>% 
  tabyl(local_spread, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 local_spread         Obs         Lev     Lev+5FU
   contiguous  6.3%  (20)  3.9%  (12)  3.6%  (11)
       muscle 12.1%  (38) 11.6%  (36) 10.5%  (32)
       serosa 79.0% (249) 83.5% (259) 82.6% (251)
    submucosa  2.5%   (8)  1.0%   (3)  3.3%  (10)
Code
# colum obstruction
colon_surv %>% 
  tabyl(obstruct, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 obstruct         Obs         Lev     Lev+5FU
        0 80.0% (252) 79.7% (247) 82.2% (250)
        1 20.0%  (63) 20.3%  (63) 17.8%  (54)
Code
# colon perforation
colon_surv %>% 
  tabyl(perfor, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 perfor         Obs         Lev     Lev+5FU
      0 97.1% (306) 96.8% (300) 97.4% (296)
      1  2.9%   (9)  3.2%  (10)  2.6%   (8)
Code
# Adherance to nearby organs
colon_surv %>% 
  tabyl(adhere, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 adhere         Obs         Lev     Lev+5FU
      0 85.1% (268) 84.2% (261) 87.2% (265)
      1 14.9%  (47) 15.8%  (49) 12.8%  (39)
Code
# extent of local tumor spread
colon_surv %>% 
  tabyl(local_spread, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 local_spread         Obs         Lev     Lev+5FU
   contiguous  6.3%  (20)  3.9%  (12)  3.6%  (11)
       muscle 12.1%  (38) 11.6%  (36) 10.5%  (32)
       serosa 79.0% (249) 83.5% (259) 82.6% (251)
    submucosa  2.5%   (8)  1.0%   (3)  3.3%  (10)
Code
# More than 4 lymph nodes with cancer
colon_surv %>% 
  tabyl(node4, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 node4         Obs         Lev     Lev+5FU
     0 72.4% (228) 71.3% (221) 74.0% (225)
     1 27.6%  (87) 28.7%  (89) 26.0%  (79)
Code
# Recurrence
colon_surv %>% 
  tabyl(recurrence, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 recurrence         Obs         Lev     Lev+5FU
          0 43.8% (138) 44.5% (138) 60.9% (185)
          1 56.2% (177) 55.5% (172) 39.1% (119)
Code
colon_surv %>% 
  tabyl(recurrence, status) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 recurrence           0           1
          0 88.7% (423)  8.4%  (38)
          1 11.3%  (54) 91.6% (414)
Code
# time from surgery to registration
colon_surv %>% 
  tabyl(surg, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 surg         Obs         Lev     Lev+5FU
    0 71.1% (224) 74.2% (230) 75.0% (228)
    1 28.9%  (91) 25.8%  (80) 25.0%  (76)


Summary statistics grouped by treatment

Code
summary_table <- colon_surv%>%group_by(rx)%>%summarise(count =n(),
                                                       male = sum(sex),
                                                       median_age = median(age),
                                                       ct_perforation = sum(perfor),
                                                       ct_adherence_nerby_organ = sum(adhere),
                                                       ct_death = sum(death),
                                                       perc_male = (male/count)*100,
                                                       iqr_age = IQR(age))
summary_table
# A tibble: 3 × 9
  rx      count  male median_age ct_perforation ct_adherence_nerby_or…¹ ct_death
  <fct>   <int> <dbl>      <dbl>          <dbl>                   <dbl>    <dbl>
1 Obs       315   166         60              9                      47      168
2 Lev       310   177         61             10                      49      161
3 Lev+5FU   304   141         62              8                      39      123
# ℹ abbreviated name: ¹​ct_adherence_nerby_organ
# ℹ 2 more variables: perc_male <dbl>, iqr_age <dbl>
Code
g <- colon_surv%>%filter(rx == "Lev+5FU")
summary(g$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   26.0    52.0    62.0    59.7    70.0    81.0 

Insight: Each treatment group had about 300 participants. Median age, number of participants with perforation and adherence are similar between the three groups. Whereas the total number of death are higher in the Lev+5FU group.

II. Table 1: Description of the study population

O bservation (%) Amisole (%) Amisole + 5-FU (%)
N=315 N=310 N=304
D e mographics
Male 166 (52.3) 177 (57.1) 141
Median age (years) [IQR] 60 [53,68] 61 [53,69] 61 [52,70]
Cancer cha r acterstics
Colon o bstruction 63 (20.0) 63 (20.3) 54 (17.8)
Colon p erforation 9 (2.9) 10 (3.2) 8 (2.6)
Adherence to nearby organs 47 (14.9) 49 (15.8) 39 (12.8)
Diff e rentiation of tumor
Well 27 (8.6) 37 (11.9) 29 (9.5)
Moderate 236 (74.9) 229 (73.9) 221 (72.7)
Poor 52 (16.5) 44 (14.2) 54 (17.8)
Extent of local spread
Contiguous 20 (6.3) 12 (3.9) 11 (3.6)
Muscle 38 (12.1) 36 (11.6) 32 (10.5)
Serosa 249 (79.0) 259 (83.5) 251 (82.6)
Submucosa 8 (2.5) 3 (1.0) 10 (3.3)
More than 4 lymh nodes with cancer Yes 87 (27.6) 89 (28.7) 79 (26.0)
Recurrence (%) Yes 177 (56.2) 172 (55.5) 119 (39.1)
Short time from surgery to r e gistration (%) Yes 91 (28.9) 80 (25.8) 76 (25.0)

III. Methods

The Cox proportional hazards model was used to model the relationship between survival time and different lung cancer treatments. In particlular the survival time will be compared between the untreated group (observation) vs. those treated with amisole (Lev), or amisole + 5-FU. The Cox regression model was chosen for this study because it is useful for studying association between survival time of patients and predictors and allows estimating the relative risk or hazard ratios due to the covariates, i.e., treatment status. The time (in days) until event, i.e, death, will be modeled as a function of treatment and other variables, including age, sex, and various tumor characteristics. Significant predictors were included in the final model.

Statistical analysis

The R statistical software version 4.3.2 [1] was used for all analysis. The Survival package was used to construct the Cox regression model [2] [3].

Cox regression model is based on the hazard function \(h_x(t)\) with covariates at time t given by:

\(h_x(t)=h_0(t)\exp(\beta_1x_1 +\beta_2x_2 + \dots + \beta_p x_p)\)

Where:

\(h_x(t)\) is the hazard function

\(h_0(t)\) is the baseline hazard function

\(\beta_1x_1 + \beta_2x_2 + \dots +\beta_p x_p\) represent the linear combination of covariates and their coefficient

The hazards for the observation vs. amisole (Lev), or amisole + 5-FU group with covariate values x1 and x2 are given by: \(hx_1(t)=h_0(t)\exp(\beta_1x_1)\) and \(hx_2(t)=h_0(t)\exp(\beta_2x_2)\), respectively

The hazard ratio is expressed as: HR = \(hx_2(t)\) / \(hx_1(t)\) = \(\exp[\beta(x_2-x_1)]\)

The Schoenfeld residual plot was constructed to test Cox proportional hazards assumption. When the proportional hazards assumpiton was not met for any of the covariates, stratification approach was explored. The Survminer [4] package was used to plot the Kaplan-Meier curve to visualize the survival probability over time for each treatment group.

Multicolinearity was tested using Variant Inflation Factor (VIF) calculated using MASS package [5].

The R MASS package was used for Stepwise model selection, using “both” forward and backward variable selection [5]. For Stepwise selection, stepAIC() function uses AIC (Akaike Information Criterion) as the measure to add or remove predictors from the model. Model performance was evaluated using 100-fold cross-validation using the boot package [6] [7].

IV. Analysis: Cox regression model

Plot survival times

Code
# Create new incremental count id
colon_surv$idcount <- c(1:length(colon_surv$id))

# Order by survival time and create an order variable:
colon_surv <- colon_surv[order(-colon_surv$time, colon_surv$status),]
colon_surv$order <- c(1:length(colon_surv$idcount))

ggplot(data=colon_surv, aes(x=time, y=order)) +
geom_rect(xmin=23,xmax=colon_surv$time,ymin=colon_surv$order,ymax=colon_surv$order+1, colour="lightgray") +
geom_rect(xmin=colon_surv$time-2,xmax=colon_surv$time,ymin=colon_surv$order,ymax=colon_surv$order+1,
          fill=factor(colon_surv$status+1)) +
geom_vline(xintercept= 1976,linetype="solid") +
scale_x_continuous(breaks=seq(20,3330,650)) +
geom_text(aes(2600, 750, label="Median Survival Time")) +
xlab("Survival Time (Days)") + ylab("Participants (ordered by survival time)") +
ggtitle("Survival Times for Participant") +
theme_classic() +
theme(legend.position="none",
      panel.grid.major=element_blank(),
      panel.grid.minor=element_blank(),
      panel.background=element_blank(),
      axis.line.x = element_line(color = "black"),
      axis.line.y = element_line(color = "black"))

Plot survival curve stratified by treatment group

Code
library(survminer)
library(survival)

fit <- survfit(Surv(time,status) ~ rx, data = colon_surv)
ggsurvplot(fit, data=colon_surv, risk.table = TRUE)

Code
colon_surv%>%group_by(rx)%>%summarise(t= median(time))
# A tibble: 3 × 2
  rx          t
  <fct>   <dbl>
1 Obs      1856
2 Lev      1882
3 Lev+5FU  2100

Insight: The time for 50% Survival probability of the group treated with Lev+5Fu is over 3000 days while the survival time for the observation and Lev group is around 2000 days

Fit base Model

Code
m0 <- coxph(Surv(time, status) ~ 1, data = colon_surv)
summary(m0)
Call:  coxph(formula = Surv(time, status) ~ 1, data = colon_surv)

Null model
  log likelihood= -2930.192 
  n= 929 

Perform cox regression to test whether treatment is significant predictor

Code
#| echo: true
#| message: false
#| warning: false

# Univariate analysis
m1 <- coxph(Surv(time, status) ~ rx, data = colon_surv)
summary(m1)
Call:
coxph(formula = Surv(time, status) ~ rx, data = colon_surv)

  n= 929, number of events= 452 

              coef exp(coef) se(coef)      z Pr(>|z|)   
rxLev     -0.02664   0.97371  0.11030 -0.241  0.80917   
rxLev+5FU -0.37171   0.68955  0.11875 -3.130  0.00175 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

          exp(coef) exp(-coef) lower .95 upper .95
rxLev        0.9737      1.027    0.7844    1.2087
rxLev+5FU    0.6896      1.450    0.5464    0.8703

Concordance= 0.536  (se = 0.013 )
Likelihood ratio test= 12.15  on 2 df,   p=0.002
Wald test            = 11.56  on 2 df,   p=0.003
Score (logrank) test = 11.68  on 2 df,   p=0.003

Insight: the coefficient of Lev is not significant, suggesting that there is no evidence that this treatment affects survival time compared to observation. however Lev+5Fu is significant (p=0.00175), indicating that the treatment Lev +5Fu has a statistically significant effect on survival time compared to the reference group. The negative sign indicates that this treatment group has a lower hazard and likely a longer survival time.

The hazard ratio for Lex+5FU (0.690), indicating the risk of death is about 31% lower compared to the observation group.

The p-values indicate that the model is significant.

Test the Cox proportional hazard assumption

Code
cox.zph(m1)
       chisq df    p
rx      1.48  2 0.48
GLOBAL  1.48  2 0.48
Code
zph_test <- cox.zph(m1)

print(zph_test)
       chisq df    p
rx      1.48  2 0.48
GLOBAL  1.48  2 0.48
Code
# plot the Schoenfeld residuals
plot(zph_test)

Insight: The Schoenfeld residal plot shows that the residuals are scattered randomly and the smooth trend line is horizontal near 0. This suggests that the hazard ratio for rx (treatment status) is constant over time and the proportional hazard assumption is met. The global p-value is >0.05, indicating that the the assumption is met.

Perform multivariate analysis, including all variables to determine which predictors are significant.

Code
# Subset data for modeling
df <- colon_surv%>%dplyr::select(!c(id,study,etype,death,differ,recurrence, extent,surg_to_reg_time, idcount, order, nodes))

# multivariant analysis
m2 <- coxph(Surv(time, status) ~ rx+ age + sex + perfor + adhere + surg + obstruct + differentiation + node4+
              local_spread, data = df)  
summary(m2)
Call:
coxph(formula = Surv(time, status) ~ rx + age + sex + perfor + 
    adhere + surg + obstruct + differentiation + node4 + local_spread, 
    data = df)

  n= 929, number of events= 452 

                            coef  exp(coef)   se(coef)      z Pr(>|z|)    
rxLev                 -0.0160445  0.9840835  0.1115300 -0.144 0.885612    
rxLev+5FU             -0.3742695  0.6877915  0.1196533 -3.128 0.001760 ** 
age                    0.0070459  1.0070708  0.0040596  1.736 0.082633 .  
sex                    0.0412783  1.0421421  0.0952633  0.433 0.664791    
perfor                 0.0005371  1.0005373  0.2695697  0.002 0.998410    
adhere                 0.1689086  1.1840119  0.1295242  1.304 0.192210    
surg                   0.2362177  1.2664500  0.1032961  2.287 0.022207 *  
obstruct               0.2865577  1.3318350  0.1173103  2.443 0.014576 *  
differentiationpoor    0.3579964  1.4304604  0.1222148  2.929 0.003398 ** 
differentiationwell    0.0812878  1.0846830  0.1662752  0.489 0.624930    
node4                  0.9353140  2.5480134  0.0988712  9.460  < 2e-16 ***
local_spreadmuscle    -0.9503933  0.3865890  0.2554577 -3.720 0.000199 ***
local_spreadserosa    -0.4526719  0.6359268  0.1986566 -2.279 0.022687 *  
local_spreadsubmucosa -1.2432513  0.2884449  0.5396100 -2.304 0.021224 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                      exp(coef) exp(-coef) lower .95 upper .95
rxLev                    0.9841     1.0162    0.7909    1.2245
rxLev+5FU                0.6878     1.4539    0.5440    0.8696
age                      1.0071     0.9930    0.9991    1.0151
sex                      1.0421     0.9596    0.8646    1.2561
perfor                   1.0005     0.9995    0.5899    1.6970
adhere                   1.1840     0.8446    0.9186    1.5262
surg                     1.2664     0.7896    1.0343    1.5507
obstruct                 1.3318     0.7508    1.0583    1.6761
differentiationpoor      1.4305     0.6991    1.1258    1.8176
differentiationwell      1.0847     0.9219    0.7830    1.5026
node4                    2.5480     0.3925    2.0991    3.0929
local_spreadmuscle       0.3866     2.5867    0.2343    0.6378
local_spreadserosa       0.6359     1.5725    0.4308    0.9387
local_spreadsubmucosa    0.2884     3.4669    0.1002    0.8306

Concordance= 0.674  (se = 0.012 )
Likelihood ratio test= 147  on 14 df,   p=<2e-16
Wald test            = 150.3  on 14 df,   p=<2e-16
Score (logrank) test = 161.3  on 14 df,   p=<2e-16

Insight: When all variables are included in the model, age, obstruction, differentiation, and recurrence were significant predictor. However, treatment is no longer significant.

Calculate Variance Inflation Factor (VIF) to assess multicollinearity among predictors

Code
vif <- vif(m2)
print(vif)
                    GVIF Df GVIF^(1/(2*Df))
rx              1.035127  2        1.008668
age             1.042883  1        1.021217
sex             1.022620  1        1.011247
perfor          1.053420  1        1.026363
adhere          1.092721  1        1.045333
surg            1.013876  1        1.006914
obstruct        1.055334  1        1.027295
differentiation 1.062714  2        1.015323
node4           1.046133  1        1.022806
local_spread    1.091169  3        1.014648

Insight: None of the variables have VIF values above 5, therefore there is no multicollinearity

Variable selection method 1. Model survival while including different cancer characteristics as predictors separately to identify significance predictors.

Code
m2 <- coxph(Surv(time, status) ~ rx+ age + sex + perfor + adhere + surg + obstruct + differentiation + node4+
              local_spread, data = df)  

# Treatment
m2a <- coxph(Surv(time, status) ~ rx, data = colon_surv) # significant
summary(m2a)
Call:
coxph(formula = Surv(time, status) ~ rx, data = colon_surv)

  n= 929, number of events= 452 

              coef exp(coef) se(coef)      z Pr(>|z|)   
rxLev     -0.02664   0.97371  0.11030 -0.241  0.80917   
rxLev+5FU -0.37171   0.68955  0.11875 -3.130  0.00175 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

          exp(coef) exp(-coef) lower .95 upper .95
rxLev        0.9737      1.027    0.7844    1.2087
rxLev+5FU    0.6896      1.450    0.5464    0.8703

Concordance= 0.536  (se = 0.013 )
Likelihood ratio test= 12.15  on 2 df,   p=0.002
Wald test            = 11.56  on 2 df,   p=0.003
Score (logrank) test = 11.68  on 2 df,   p=0.003
Code
# Demographics
m2b <- coxph(Surv(time, status) ~ age + sex, data = colon_surv) # not significant
summary(m2b)
Call:
coxph(formula = Surv(time, status) ~ age + sex, data = colon_surv)

  n= 929, number of events= 452 

        coef exp(coef) se(coef)     z Pr(>|z|)
age 0.001951  1.001952 0.004031 0.484    0.628
sex 0.012955  1.013040 0.094205 0.138    0.891

    exp(coef) exp(-coef) lower .95 upper .95
age     1.002     0.9981    0.9941     1.010
sex     1.013     0.9871    0.8422     1.218

Concordance= 0.511  (se = 0.014 )
Likelihood ratio test= 0.26  on 2 df,   p=0.9
Wald test            = 0.25  on 2 df,   p=0.9
Score (logrank) test = 0.25  on 2 df,   p=0.9
Code
# cancer characterstics
m2c <- coxph(Surv(time, status) ~ perfor + adhere + obstruct, data = colon_surv) # adhere and obstruct are significant
summary(m2c)
Call:
coxph(formula = Surv(time, status) ~ perfor + adhere + obstruct, 
    data = colon_surv)

  n= 929, number of events= 452 

             coef exp(coef) se(coef)      z Pr(>|z|)  
perfor   -0.03415   0.96643  0.26932 -0.127   0.8991  
adhere    0.31011   1.36358  0.12572  2.467   0.0136 *
obstruct  0.25794   1.29426  0.11547  2.234   0.0255 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

         exp(coef) exp(-coef) lower .95 upper .95
perfor      0.9664     1.0347    0.5701     1.638
adhere      1.3636     0.7334    1.0658     1.745
obstruct    1.2943     0.7726    1.0321     1.623

Concordance= 0.536  (se = 0.012 )
Likelihood ratio test= 10.82  on 3 df,   p=0.01
Wald test            = 11.51  on 3 df,   p=0.009
Score (logrank) test = 11.6  on 3 df,   p=0.009
Code
# Differentiation of tumor
m2d <- coxph(Surv(time, status) ~ differentiation, data = colon_surv) # significant
summary(m2d)
Call:
coxph(formula = Surv(time, status) ~ differentiation, data = colon_surv)

  n= 929, number of events= 452 

                        coef exp(coef) se(coef)      z Pr(>|z|)    
differentiationpoor  0.48265   1.62036  0.12040  4.009  6.1e-05 ***
differentiationwell -0.05027   0.95097  0.16408 -0.306    0.759    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                    exp(coef) exp(-coef) lower .95 upper .95
differentiationpoor     1.620     0.6171    1.2798     2.052
differentiationwell     0.951     1.0516    0.6895     1.312

Concordance= 0.544  (se = 0.012 )
Likelihood ratio test= 15.34  on 2 df,   p=5e-04
Wald test            = 16.97  on 2 df,   p=2e-04
Score (logrank) test = 17.31  on 2 df,   p=2e-04
Code
# Extent of local spread
m2f <- coxph(Surv(time, status) ~ local_spread, data = colon_surv) # significant
summary(m2f)
Call:
coxph(formula = Surv(time, status) ~ local_spread, data = colon_surv)

  n= 929, number of events= 452 

                         coef exp(coef) se(coef)      z Pr(>|z|)    
local_spreadmuscle    -1.0892    0.3365   0.2498 -4.361 1.29e-05 ***
local_spreadserosa    -0.5043    0.6039   0.1927 -2.617  0.00888 ** 
local_spreadsubmucosa -1.7283    0.1776   0.5336 -3.239  0.00120 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                      exp(coef) exp(-coef) lower .95 upper .95
local_spreadmuscle       0.3365      2.972    0.2062    0.5490
local_spreadserosa       0.6039      1.656    0.4139    0.8811
local_spreadsubmucosa    0.1776      5.631    0.0624    0.5054

Concordance= 0.552  (se = 0.009 )
Likelihood ratio test= 29.21  on 3 df,   p=2e-06
Wald test            = 25.35  on 3 df,   p=1e-05
Score (logrank) test = 26.94  on 3 df,   p=6e-06
Code
# Recurrence
# m2g <- coxph(Surv(time, status) ~ recurrence, data = colon_surv) # significant
# summary(m2g)

# short time from surgery to registration
m2h <- coxph(Surv(time, status) ~ surg, data = colon_surv) # significant
summary(m2h)
Call:
coxph(formula = Surv(time, status) ~ surg, data = colon_surv)

  n= 929, number of events= 452 

       coef exp(coef) se(coef)     z Pr(>|z|)  
surg 0.2333    1.2627   0.1026 2.274   0.0229 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     exp(coef) exp(-coef) lower .95 upper .95
surg     1.263     0.7919     1.033     1.544

Concordance= 0.523  (se = 0.011 )
Likelihood ratio test= 5.01  on 1 df,   p=0.03
Wald test            = 5.17  on 1 df,   p=0.02
Score (logrank) test = 5.2  on 1 df,   p=0.02
Code
# include significant predictors in final model
model <- coxph(Surv(time, status) ~ rx + adhere + surg + obstruct + differentiation
              + local_spread, data = colon_surv)

summary(model)
Call:
coxph(formula = Surv(time, status) ~ rx + adhere + surg + obstruct + 
    differentiation + local_spread, data = colon_surv)

  n= 929, number of events= 452 

                            coef  exp(coef)   se(coef)      z Pr(>|z|)    
rxLev                 -0.0047151  0.9952960  0.1110663 -0.042 0.966137    
rxLev+5FU             -0.3588384  0.6984872  0.1191372 -3.012 0.002596 ** 
adhere                 0.1499547  1.1617816  0.1285420  1.167 0.243380    
surg                   0.2103485  1.2341081  0.1031988  2.038 0.041521 *  
obstruct               0.2000040  1.2214077  0.1148937  1.741 0.081723 .  
differentiationpoor    0.4507752  1.5695283  0.1217364  3.703 0.000213 ***
differentiationwell   -0.0003827  0.9996174  0.1651661 -0.002 0.998151    
local_spreadmuscle    -0.9927505  0.3705561  0.2551148 -3.891 9.97e-05 ***
local_spreadserosa    -0.4291507  0.6510618  0.1987757 -2.159 0.030853 *  
local_spreadsubmucosa -1.5035698  0.2223351  0.5385347 -2.792 0.005239 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                      exp(coef) exp(-coef) lower .95 upper .95
rxLev                    0.9953     1.0047   0.80059    1.2373
rxLev+5FU                0.6985     1.4317   0.55303    0.8822
adhere                   1.1618     0.8607   0.90304    1.4947
surg                     1.2341     0.8103   1.00812    1.5108
obstruct                 1.2214     0.8187   0.97513    1.5299
differentiationpoor      1.5695     0.6371   1.23637    1.9925
differentiationwell      0.9996     1.0004   0.72318    1.3817
local_spreadmuscle       0.3706     2.6986   0.22475    0.6110
local_spreadserosa       0.6511     1.5360   0.44099    0.9612
local_spreadsubmucosa    0.2223     4.4977   0.07738    0.6389

Concordance= 0.618  (se = 0.013 )
Likelihood ratio test= 63.62  on 10 df,   p=7e-10
Wald test            = 60.86  on 10 df,   p=2e-09
Score (logrank) test = 63.16  on 10 df,   p=9e-10

Test proportional hazard assumption for the model with significant predictors

Code
cox.zph(model)
                  chisq df       p
rx               1.8302  2    0.40
adhere           0.1401  1    0.71
surg             0.0596  1    0.81
obstruct         6.6401  1    0.01
differentiation 20.0601  2 4.4e-05
local_spread     6.7511  3    0.08
GLOBAL          35.5839 10 9.9e-05
Code
zph_test <- cox.zph(m2)

print(zph_test)
                  chisq df       p
rx               2.4436  2 0.29470
age              0.5116  1 0.47444
sex              0.9111  1 0.33981
perfor           0.6792  1 0.40987
adhere           0.1074  1 0.74310
surg             0.0281  1 0.86691
obstruct         6.1995  1 0.01278
differentiation 17.7305  2 0.00014
node4            5.7169  1 0.01680
local_spread     7.1050  3 0.06862
GLOBAL          40.5778 14 0.00021
Code
# plot the Schoenfeld residuals
plot(zph_test)

Insight: Differentiation variable does not meet proportional hazards assumption.

Variable selection method 2: Use the MASS package stepAIC() function for stepwise selection by using AIC (Akaike Information Criterion) as the measure to add or remove predictors from the model.

Code
library(MASS)       # for stepwise regression
library(boot)       # for cross-validation

Perform Stepwise variable selection

Code
# stepwise selection
stepwise_model <- stepAIC(m2, direction = "both")
Start:  AIC=5741.4
Surv(time, status) ~ rx + age + sex + perfor + adhere + surg + 
    obstruct + differentiation + node4 + local_spread

                  Df    AIC
- perfor           1 5739.4
- sex              1 5739.6
- adhere           1 5741.0
<none>               5741.4
- age              1 5742.5
- surg             1 5744.5
- obstruct         1 5745.1
- differentiation  2 5745.4
- rx               2 5749.9
- local_spread     3 5753.1
- node4            1 5822.0

Step:  AIC=5739.4
Surv(time, status) ~ rx + age + sex + adhere + surg + obstruct + 
    differentiation + node4 + local_spread

                  Df    AIC
- sex              1 5737.6
- adhere           1 5739.1
<none>               5739.4
- age              1 5740.5
+ perfor           1 5741.4
- surg             1 5742.5
- obstruct         1 5743.2
- differentiation  2 5743.5
- rx               2 5747.9
- local_spread     3 5751.1
- node4            1 5820.1

Step:  AIC=5737.59
Surv(time, status) ~ rx + age + adhere + surg + obstruct + differentiation + 
    node4 + local_spread

                  Df    AIC
- adhere           1 5737.3
<none>               5737.6
- age              1 5738.6
+ sex              1 5739.4
+ perfor           1 5739.6
- surg             1 5740.7
- obstruct         1 5741.2
- differentiation  2 5741.7
- rx               2 5746.2
- local_spread     3 5749.3
- node4            1 5818.2

Step:  AIC=5737.26
Surv(time, status) ~ rx + age + surg + obstruct + differentiation + 
    node4 + local_spread

                  Df    AIC
<none>               5737.3
+ adhere           1 5737.6
- age              1 5738.6
+ sex              1 5739.1
+ perfor           1 5739.2
- surg             1 5740.7
- obstruct         1 5740.9
- differentiation  2 5742.1
- rx               2 5746.0
- local_spread     3 5750.4
- node4            1 5817.4
Code
summary(stepwise_model)
Call:
coxph(formula = Surv(time, status) ~ rx + age + surg + obstruct + 
    differentiation + node4 + local_spread, data = df)

  n= 929, number of events= 452 

                           coef exp(coef)  se(coef)      z Pr(>|z|)    
rxLev                 -0.010789  0.989269  0.111379 -0.097  0.92283    
rxLev+5FU             -0.375746  0.686776  0.119575 -3.142  0.00168 ** 
age                    0.007366  1.007393  0.004051  1.818  0.06901 .  
surg                   0.243871  1.276179  0.103202  2.363  0.01813 *  
obstruct               0.283165  1.327324  0.116138  2.438  0.01476 *  
differentiationpoor    0.373783  1.453222  0.121599  3.074  0.00211 ** 
differentiationwell    0.069022  1.071459  0.165690  0.417  0.67699    
node4                  0.929854  2.534140  0.098507  9.440  < 2e-16 ***
local_spreadmuscle    -0.995556  0.369518  0.252106 -3.949 7.85e-05 ***
local_spreadserosa    -0.501169  0.605822  0.194154 -2.581  0.00984 ** 
local_spreadsubmucosa -1.322018  0.266597  0.536289 -2.465  0.01370 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                      exp(coef) exp(-coef) lower .95 upper .95
rxLev                    0.9893     1.0108   0.79526    1.2306
rxLev+5FU                0.6868     1.4561   0.54329    0.8682
age                      1.0074     0.9927   0.99943    1.0154
surg                     1.2762     0.7836   1.04248    1.5623
obstruct                 1.3273     0.7534   1.05711    1.6666
differentiationpoor      1.4532     0.6881   1.14506    1.8443
differentiationwell      1.0715     0.9333   0.77436    1.4826
node4                    2.5341     0.3946   2.08921    3.0738
local_spreadmuscle       0.3695     2.7062   0.22545    0.6057
local_spreadserosa       0.6058     1.6506   0.41408    0.8864
local_spreadsubmucosa    0.2666     3.7510   0.09319    0.7627

Concordance= 0.672  (se = 0.012 )
Likelihood ratio test= 145.1  on 11 df,   p=<2e-16
Wald test            = 149.1  on 11 df,   p=<2e-16
Score (logrank) test = 159.7  on 11 df,   p=<2e-16
Code
# FINAL MODEL

m2_select <- coxph(formula = Surv(time, status) ~ rx + age + surg + obstruct + 
    differentiation + node4 + local_spread, data = df)

summary(m2_select)
Call:
coxph(formula = Surv(time, status) ~ rx + age + surg + obstruct + 
    differentiation + node4 + local_spread, data = df)

  n= 929, number of events= 452 

                           coef exp(coef)  se(coef)      z Pr(>|z|)    
rxLev                 -0.010789  0.989269  0.111379 -0.097  0.92283    
rxLev+5FU             -0.375746  0.686776  0.119575 -3.142  0.00168 ** 
age                    0.007366  1.007393  0.004051  1.818  0.06901 .  
surg                   0.243871  1.276179  0.103202  2.363  0.01813 *  
obstruct               0.283165  1.327324  0.116138  2.438  0.01476 *  
differentiationpoor    0.373783  1.453222  0.121599  3.074  0.00211 ** 
differentiationwell    0.069022  1.071459  0.165690  0.417  0.67699    
node4                  0.929854  2.534140  0.098507  9.440  < 2e-16 ***
local_spreadmuscle    -0.995556  0.369518  0.252106 -3.949 7.85e-05 ***
local_spreadserosa    -0.501169  0.605822  0.194154 -2.581  0.00984 ** 
local_spreadsubmucosa -1.322018  0.266597  0.536289 -2.465  0.01370 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                      exp(coef) exp(-coef) lower .95 upper .95
rxLev                    0.9893     1.0108   0.79526    1.2306
rxLev+5FU                0.6868     1.4561   0.54329    0.8682
age                      1.0074     0.9927   0.99943    1.0154
surg                     1.2762     0.7836   1.04248    1.5623
obstruct                 1.3273     0.7534   1.05711    1.6666
differentiationpoor      1.4532     0.6881   1.14506    1.8443
differentiationwell      1.0715     0.9333   0.77436    1.4826
node4                    2.5341     0.3946   2.08921    3.0738
local_spreadmuscle       0.3695     2.7062   0.22545    0.6057
local_spreadserosa       0.6058     1.6506   0.41408    0.8864
local_spreadsubmucosa    0.2666     3.7510   0.09319    0.7627

Concordance= 0.672  (se = 0.012 )
Likelihood ratio test= 145.1  on 11 df,   p=<2e-16
Wald test            = 149.1  on 11 df,   p=<2e-16
Score (logrank) test = 159.7  on 11 df,   p=<2e-16
Code
cox_summary <- tidy(m2_select)

Cross-Validation for Model Evaluation

Code
# cross validation function
cv_cox <- function(data, indices) {
  # subset data
  d <- data[indices, ]
  
  # fit model on subset
  fit <- coxph(Surv(time, status) ~ rx + surg + obstruct + differentiation + node4 +
    local_spread, data = df)
  
  # return the concordance index (C-index)
  return(survConcordance(Surv(time, status) ~ predict(fit), data = d)$concordance)
}

# Set seed for reproducibility
set.seed(45)

# perform 100-fold cross-validation
cv_results <- boot(data = df, statistic = cv_cox, R = 100)

cv_results

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = df, statistic = cv_cox, R = 100)


Bootstrap Statistics :
     original     bias    std. error
t1* 0.6691634 -0.1682034  0.01330488

Visualization of cross-validation results: plot concordance index from bootstrap samples

Code
# Print summary of cross-validation results
median(cv_results$t)
[1] 0.5018784
Code
# Visualize results

plot(cv_results)

Insight: The concordance index (c-index) of the original model (0.67) is higher than the median bootstrap samples concordance (0.5), suggesting overestimation of of model performance on the original dataset. The model predictive power is not needs improvement based on the c-index.

V. Results

Table 2. Unadjusted model: Survival after Chemotherapy for stage B/C Colon Cancer

Treatment Coefficient Hazard ratio 95% CI_upper 95% CI_lower P-value
Amisole (Lev) -0.027 0.974 0.784 1.209 0.809
Amisole + 5-FU -0.372 0.690 0.546 0.870 0.002

Table 3. Adjusted model: Survival after Chemotherapy for stage B/C Colon Cancer

T reatment Coe fficient Hazard ratio 95% CI_upper 95% CI_lower P-value
Amisole (Lev) -0.011 0.989 0.795 1.231 0.923
A misole + 5-FU -0.376 0.687 0.543 0.868 0.002
Age 0.007 1.007 0.999 1.015 0.069
Surge 0.244 1.276 1.042 1.562 0.018
Obs truction of colon 0.283 1.327 1.057 1.667 0.015
Diff erentiat ion_poor 0.374 1.453 1.145 1.844 0.002
Diff erentiat ion_well 0.069 1.072 0.774 1.483 0.677
More than 4 nodes (+) 0.930 2.534 2.089 3.074 3.745 x 10^21
Local sprea d_muscle -0.996 0.370 0.225 0.606 7.849 x 10^5
Local sprea d_serosa -0.501 0.606 0.414 0.886 0.010
Local spread_s ubmucosa -1.322 0.267 0.093 0.763 0.014

References

[1]
R Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2023. Available: https://www.R-project.org/
[2]
T. M. Therneau, A package for survival analysis in r. 2024. Available: https://CRAN.R-project.org/package=survival
[3]
Terry M. Therneau and Patricia M. Grambsch, Modeling survival data: Extending the Cox model. New York: Springer, 2000.
[4]
A. Kassambara, M. Kosinski, and P. Biecek, Survminer: Drawing survival curves using ’ggplot2’. 2021. Available: https://CRAN.R-project.org/package=survminer
[5]
W. N. Venables and B. D. Ripley, Modern applied statistics with s, Fourth. New York: Springer, 2002. Available: https://www.stats.ox.ac.uk/pub/MASS4/
[6]
A. C. Davison and D. V. Hinkley, Bootstrap methods and their applications. Cambridge: Cambridge University Press, 1997. Available: doi:10.1017/CBO9780511802843
[7]
Angelo Canty and B. D. Ripley, Boot: Bootstrap r (s-plus) functions. 2024.